Predicting terrorist attacks in the United States using localized news data

Terrorism is a major problem worldwide, causing thousands of fatalities and billions of dollars in damage every year. To address this threat, we propose a novel feature representation method and evaluate machine learning models that learn from localized news data in order to predict whether a terrorist attack will occur on a given calendar date and in a given state. The best model (a Random Forest aided by a novel variable-length moving average method) achieved area under the receiver operating characteristic (AUROC) of ≥ 0.667 (statistically significant w.r.t. random guessing with p ≤ .0001) on four of the five states that were impacted most by terrorism between 2015 and 2018. These results demonstrate that treating terrorism as a set of independent events, rather than as a continuous process, is a fruitful approach—especially when historical events are sparse and dissimilar—and that large-scale news data contains information that is useful for terrorism prediction. Our analysis also suggests that predictive models should be localized (i.e., state models should be independently designed, trained, and evaluated) and that the characteristics of individual attacks (e.g., responsible group or weapon type) were not correlated with prediction success. These contributions provide a foundation for the use of machine learning in efforts against terrorism in the United States and beyond.


Introduction
Terrorism poses a significant threat to society around the world. According to the Institute for Economics and Peace (IEP), in 2021 alone, global incidents of terrorism killed 7,142 people and caused tens of billions of dollars in economic damage [1]. Much of this activity occurs in the Middle East and other parts of Asia, leading many researchers to model the evolution and spread of terrorism in such regions. However, other areas of the world-including the United States-are not exempt from the effects of terrorism. Between 2015 and 2018, the Global Terrorism Database (GTD) recorded 229 incidents of terrorism in the United States [2]. The IEP also recently noted that the United States experienced one of the largest decreases in Positive Peace Index, a change that indicates increasing social disorder and greater risk of politicallymotivated violence [3]. Predicting such violence before it occurs could enable prevention strategies and save lives. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Several prior works address the problem of detecting terrorist activity, such as classifying pro-terrorism tweets [4]. However, in this study, we are interested in the problem of prediction rather than detection. Additionally, our task is to predict the occurrence of attacks or events, which is distinct from works that have sought to infer characteristics of an attack, such as the responsible group, after it has taken place [5][6][7][8][9]. Few works have attempted to predictively model terrorist attacks, with the recent work of Python et al. being one exception [10]. The authors trained several machine learning models using prior terrorism data (from the GTD) in conjunction with other geographic and socioeconomic features to predict attacks at discrete spatiotemporal intervals. Despite promising results and thorough analysis at an impressive and global scale, we found two fundamental problems with their approach: 1. Coarsely evaluating a model that represents a large region (e.g., West Africa) but comprises many smaller geographic cells overlooks the imbalanced spatial distribution of attacks. In this case, which is more complex than evaluation on a typical class-imbalanced dataset, measures like area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPRC) can be misleading. For example, using the data described in Table 1 and Fig 1, we can create a terrorism model for the United States that predicts an attack will occur every day in five states (CA, NY, TX, FL, and WA) but that no attacks will occur in other locations. If evaluated coarsely across the entire United States, this model would produce an AUROC of 0.733 and AUPRC of 0.468, both of which significantly outperform their respective baselines of 0.500 and 0.003. However, these outputs provide no value beyond reflecting that some regions have experienced a greater number of attacks. We avoid this problem by evaluating predictive performance for each region (state) separately.
2. In order to make meaningful predictions at a granular timescale (e.g., will a terrorist attack take place during a given week), models must have inputs of a similar temporal granularity in order to differentiate between points in time. Many of the inputs utilized by Python et al. in [10] are relatively static, such as population density and gross domestic product (GDP). The only temporally granular inputs were autoregressions based on local terrorist activity (i.e., whether there were recent terrorist attacks in the location of concern). While these autoregressions are valuable in their ability to model continuous terrorist activity, without AK, AL, AR, HI, ME, NH, OK, RI, SD, VT, WV 0 -other temporally relevant inputs the predictive models will be fundamentally limited in their ability to predict attacks at new or unusual locations. This shortcoming could also exacerbate the coarse evaluation problem described above.
Buffa et al. also frame terrorism as a binary classification problem, using a number of geospatial and population features to predict the location of terrorist attacks in Europe [11]. However, with respect to future predictions, their methods suffer from the same problems described above.
In contrast to the treatment of terrorism as a classification problem, a number of prior studies have modeled terrorism as a spatiotemporal process within a complex dynamical system. This body of work has generally been concerned with modeling the evolution of terrorist organizations, understanding the escalation of violence, or quantifying the lethality or number of attacks within a geographical region and/or over a period of time [12][13][14][15][16][17][18]. In one of these studies, Python et al. focused on modeling the continuous phenomena (e.g., lethality and frequency) exhibited by terrorism in the Middle East [19]. Li et al. similarly considered terrorism as "an unstable system of interdependent events" and treated it as a type of crime [20]. Prior crime prediction models have had some success when applied to urban, high-crime areas like Chicago and New York City [21,22], but these solutions do not necessarily translate to predicting terrorism, which is much more narrowly defined [2]. Although Li et al. applied their crime prediction model to terrorism the Middle East, neither theirs nor the other aforementioned works have considered terrorism in the United States, where attacks are much more sparse and the dynamical processes consequently more difficult to observe. For example, according to the GTD Iraq experienced 2,466 terrorist attacks in 2017, 1,154 of which were perpetrated by a single organization: the Islamic State of Iraq and the Levant (ISIL). However, in the same year, the United States experienced just 65 attacks, only five of which were perpetrated by a known organization [2]. Models that rely on systematic patterns exhibited by organizations or interdependence between events are unlikely to succeed in such a context.
Another vein of related work is the use of news data to predict social unrest. Qiao et al. used mentions of protests encoded in the Global Database of Language, Events, and Tone (GDELT) [23] to predict social unrest events in Southeast Asia [24]. Galla et al. extracted a set of longitudinal features from GDELT, which they used to predict social unrest events for a given location on a given date [25]. GDELT has also been recently utilized in other related applications, including modeling global peace [26].
In context of both the advances and limitations of these prior studies, we present a machine learning framework for predicting terrorist attacks in the United States using the GTD in conjunction with news data from GDELT. Our key contributions include the following: 1. We are, to the best of our knowledge, the first to use machine learning to predict the occurrence of individual terrorist attacks in the United States. We demonstrate that by framing terrorism prediction as a binary classification problem within discrete spatiotemporal units (i.e., did an attack take place on day X in location Y?) rather than a continuous process (i.e., the evolution of terrorist organizations over time, or the growth of terrorist activity in a region), we can model terrorism in a context where attacks are much less frequent and systematic than other regions that have previously been the focus of predictive tasks.
2. We show that large-scale news data can be successfully utilized in terrorism modeling. This is particularly noteworthy in a context where historical terrorism data is extremely sparse, and suggests many avenues for future work, such as incorporating news data into other difficult machine learning problems and studying the relationship between news trends and terrorist activity.
3. We present a simple but novel feature representation method, K-S Moving Average, which uses a Kolmogorov-Smirnov test [31] to choose an optimal observation window for each feature. The best model (a Random Forest using the K-S Moving Average) AUROC scores � 0.667 on four of five states (p � .0001 w.r.t. random guessing), outperforming other algorithms and representation methods-including several neural networks-on the classification problem.
4. We thoroughly evaluate a set of baselines and proposed models on the five states impacted most by terrorism between 2015 and 2018-New York, California, Texas, Florida, and Washington-and discuss the implications of our results for the future application of machine learning to the study of terrorism.

Materials and methods
In this section, we first formulate the problem of predicting a terrorist attack at a given location and time. We then describe our methods for data collection and preprocessing. Finally, we introduce the relevant learning algorithms and approaches to feature representation. Fig 2 outlines our methodology.

Problem formulation and notation
We formulate the task as a classification problem such that our goal is to learn a function f: x ! y, where x is a set of input features and the binary output y represents the occurrence or non-occurrence of a terrorist attack at a specific and on a given date. We consider as locations the set of 51 states in the United States (including Washington D.C., excluding Puerto Rico and other non-state territories), each with m = 862 features observed over n = 1, 413 calendar days from February 18, 2015 through December 31, 2018, inclusive (as limited by data availability, please see below for more details). Let X 2 R n�m be a feature matrix andỹ ¼ f0; 1g n be a label vector, such thatx i and y i denote the feature vector and label, respectively, that describe the i th day at the location of concern. Further, we use X i,j to represent the i th observation of the j th feature, X i:k to represent the i th through k th (inclusive) observations of all m features, and X� ,j to represent the column vector that contains all observations of the j th feature. Unless stated otherwise, we treat locations as independent such that each has a distinct feature matrix X and label vectorỹ. Finally, we use square brackets [] to denote vectors.

Data collection
Feature extraction. The Global Database of Events, Language, and Tone (GDELT) monitors worldwide print, broadcast, and online news in over 100 languages [23,27]. It contains two primary sources of information: the Global Knowledge Graph (GKG) and the Event Database. The GKG comprises over 10 TB and grows at a rate of over 500,000 records every day. Each record represents publishing information, mentions of persons and locations, textual themes, and other metadata extracted from each published article, video, or other item of news. The Event Database utilizes the CAMEO event coding framework to identify and categorize events from news in the GKG. A CAMEO event consists of an actor-an individual, country, identity group, or some other sociopolitical entity-performing an action on another actor [28]. Each action is classified under a taxonomy that includes categories such as "acknowledge or claim responsibility" (code 015) and "reject request for military aid" (code 1122). The Events Database records each event along with other extracted information, including the geographical location of the event and/or actors, political and religious affiliations for each actor, an estimate of the event's political significance, and the average sentiment of news items associated with the event. Taken together, the GKG and Events Database represent GDELT's effort to construct a global representation of "what's happening and how the world is feeling about it" [27]. We utilized version 2 of the GKG and Events Database, which were both released on February 18, 2015.
We utilized data from both the GKG and Events Database to construct the feature matrix for each location, in a manner similar to Galla et al. [25]. From the GKG, we considered 283 textual themes that are described in the GKG Category List [29]. While other themes have been added to the GKG, the aforementioned are the best-documented have been clearly recorded since the release of GDELT version 2. For each theme, we extracted all the news records associated with that theme that also mention the given location. To compute the first set of 283 features, which we call theme counts, we simply counted the relevant records in the GKG. For the next 283 features, which we call theme sentiments, we computed the average sentiment score of these records for each date. From the Events Database, we grouped all records by the 148 "base codes" at the second level of the CAMEO taxonomy [28]. The next set of 148 features, which we call CAMEO counts, is the count of news items associated with each event code for the given location, aggregated by publishing date. Our final set of 148 features, which we call CAMEO sentiments is the average sentiment score for all articles associated with each of the CAMEO codes, also aggregated by publishing date. In total, we considered 862 features, which collectively represent a location's sociopolitical climate on a given day.
Label extraction. The Global Terrorism Database (GTD), created and maintained by the University of Maryland, is an open-source database containing information on over 190,000 terrorist events around the world from 1970 through 2018 [2]. The GTD defines a terrorist attack as "the threatened or actual use of illegal force and violence by a non-state actor to attain a political, economic, religious, or social goal through fear, coercion, or intimidation" [30]. According to its codebook, an event must meet the following requirements to be classified as a terrorist attack and recorded in the GTD: 1. All three of the following criteria must be met: 1. The incident must be intentional.
2. The incident must entail some level of violence or immediate threat of violence.
3. The perpetrators of the incident must be sub-national actors.
2. Two of the following three criteria must be met: 1. The act must be aimed at attaining a political, economic, religious, or social goal.
2. There must be evidence of an intention to coerce, intimidate, or convey some other message to a larger audience (or audiences) than the immediate victims.
3. The action must be outside the context of legitimate warfare activities.
We considered all terrorist events that occurred in the United States on or after February 18, 2015-the release date of GDELT version 2. Of these 229 events, shown in Fig 1, 83.8% were successful, meaning that the attack produced some kind of tangible effect (such as an explosion), even if there were no casualties or significant consequences. 19.1% of the attacks resulted in at least one fatality, and in 27.2% at least one person was wounded. If multiple attacks were carried out in a continuous period of time or location, the GTD records them as a single incident. However, if either the time or location was discontinuous (even if the attacks were carried out in close proximity or by the same perpetrators) the GTD records them as multiple incidents. After accounting for this overlap in the 229 events, we extracted 208 unique location and date pairs as positive labels (y i = 1) for the classification problem. We treated all location and date pair not recorded in the GTD as negative labels (y i = 0).

Observation windows
Recall that our goal is to predict future terrorist attacks. However, the features and labels described above represent news activity and the occurrence of an attack, respectively, for the same day. In a real-world scenario, the news features for a given day would not be available until after the attack occurred. Additionally, the features only consider news activity for a single day; this precludes the discovery of any long-range temporal patterns. In other words, in their raw form the GDELT features can only be used same-day terrorism detection, but not prediction.
To address both of these shortcomings, we performed an additional preprocessing step to generate new feature representations for each date and location pair using an observation window [25]. This window, represented by the hyperparameter Δt � 1, is the number of prior days for which news features are observed when predicting an attack. As outlined in Fig 2, we evaluated three distinct types of observation windows: a fixed-length moving average, a variable-length moving average, and a stacked representation. Each approach is detailed below.
Fixed-length moving average. In the simplest case, for each feature X� ,j , we considered the unweighted mean of the previous Δt days, i.e.
Note that when Δt = 1, the representation for a given location and date is simply the news features as observed from the previous day. Increasing Δt allows us to observe a greater number of prior days and has the additional benefit of smoothing noise. However, it also blurs the distinctions between days and causes each feature to no longer be independently distributed; we address this problem in our discussion of Experimental Setup. Variable-length (K-S) moving average. A significant limitation of the fixed-length approach is that it assumes a static value of Δt for all features. We addressed this by implementing a more flexible version of the observation window, which chooses a preferred representation for each feature according to a two-sample Kolmogorov-Smirnov (K-S) test [31]. The K-S test computes the distance between the empirical distribution function (EDF) of two samples in order to test the null hypothesis that the two samples are drawn from the same distribution. Intuitively, we use the K-S test to choose a representation for each feature that maximizes the difference between the EDF for both classes for that feature. The full procedure is detailed in Algorithm 1. In summary, we first specify a maximum observation window Δt � . Next, for each feature we compute its moving moving average (Eq 1) for each t � Δt � . Then for each feature and value of t we perform a K-S test, where the first sample is the set of observations for that feature for non-events (y i = 0) and the second sample is the set of observations for events (y i = 1). Finally, we choose the representation for each feature (i.e., value of t) that maximizes the distance between the empirical distribution functions of events and nonevents (i.e., returns the minimum p-value).

Algorithm 1 K-S Moving Average
Require: A feature matrix X 2 R n�m , a label vectorỹ ¼ f0; 1g n , a maximum observation window Δt � � 1 Ensure: A processed feature matrix X 1: for j 1 to m do 2: p min 1.0 ⊳ Store the minimum p-value 3: a X� ,j ⊳ Store the original observations for feature j 4: for t 1 to Δt � do 5: a 0 MA(a, t) ⊳ Compute the current moving average via Eq 1 6: if p < p min then 8: p min p ⊳ Store the new minimum 9: X� ,j a 0 ⊳ Replace the values for feature j 10: return X This procedure requires computing a moving average over n instances, which takes O(n) time, exactly Δt � times for each of the m features. Therefore, the time complexity of Algorithm 1 is O(Δt � nm). Since the window is computed independently for each feature, the algorithm can be easily parallelized. Fig 12 in S1 Appendix reports empirical runtimes on the data used in this study.
Stacked representation. Both moving average approaches described previously are unweighted and thus assume each of the values within the observation window are equally important. While this may be suitable for simpler learning algorithms, neural networks are well-equipped to discover more complex patterns by generating their own (hidden) representations. Therefore, when serving input to neural networks we do not precompute representations using either of the moving average approaches described above; instead, we represent each instance as stacked feature vectors from the previous Δt days, i.e. X i−Δt:i−1 for some arbitrary day i. By learning a mapping from these stacked inputs to a hidden representation, the neural network is essentially learning its own observation window. We discuss this in more detail in the following section.

Learning algorithms
Recall that our task is to learn a function f: x ! {0, 1} for some input x. Toward this end we explored two families of machine learning algorithms, ensembles and neural networks, which we detail below.
Ensembles. An ensemble is combination of multiple classifiers. When individual classifiers are independent and diverse, the ensemble can mitigate the weaknesses and limitations of a single classifier. This characteristic has enabled ensembles to provide state-of-the-art performance on a number of machine learning tasks [32]. We evaluated the following three ensembles: 1. XGBoost: a tree-based ensemble that introduced a number of innovations to gradient boosting algorithms. XGBoost has achieved state-of-the-art performance on a number of difficult problems [33].
2. Random Forest: a collection of unpruned decision trees, each grown from a random and uniformly sampled subspace (of features), that uses a weighted voting mechanism to classify each instance [34]. The Random Forest has found success in a range of machine learning tasks and established itself as a strong general-purpose ensemble [32].
3. AdaBoost: a collection of decision stumps that are each trained on a single bootstrap (a small sample drawn with replacement from the training set). AdaBoost's key innovation is that it iteratively updates the sampling distribution, increasing the probability of selection in the next bootstrap for instances that were misclassified by the previous decision stump [35]. This procedure intuitively pushes the ensemble to "focus" on more difficult instances; however, it also makes AdaBoost very sensitive to noise.
Neural networks. A neural network propagates an instance through layers of nodes, or neurons, to compute a label. Training a neural network is the process of learning the set of edge weights that minimizes error on the training data with respect to some loss function. The power of neural networks lies in their use of various configurations of hidden (intermediate) layers to generate new representations of the input. As such, neural networks have become state-of-the-art in many complex learning tasks-especially those that involve learning nonlinear functions [32]. A neural network is a function that classifies an input x according to the following: where W is a real-valued weight matrix andb is a real-valued bias vector that are both learned via backpropagation; σ is the sigmoid activation function, and h(x) is the learned hidden representation of the raw input x. In this context, what distinguishes classes of neural networks is how they compute h(x). For brevity and in this section only, we use X i−Δt:i−1 to represent the matrix form of the stacked representation for an arbitrary day i and observation window Δt � 1. Given this definition of X 0 we utilized the following neural network functions: 1. Feedforward Neural Network (FFNN): a basic architecture in which inputs are propagated strictly forward from the input layer, through a series of L hidden layer(s), and finally to the output layer. We utilize two distinct FFNN architectures: 1. One hidden layer (L = 1, not pictured): computes h(X 0 ) via a single dense layer, i.e.
hðX 0 Þ ¼ ReLUðW 0 ½X 0 �;1 ; ::: where ReLU is the rectified linear unit activation function, W 0 2 R m�k is a learned weight matrix with k (a hyperparameter) neurons in the hidden layer, ½X 0 �;1 ; :::; X 0 �;m � represents the vector concatenation of all observations of all features in X 0 , andb 0 2 R k is a learned bias vector for the hidden layer.
2. Two hidden layers (L = 2, pictured in Fig 2): computes a hidden representation for each feature independently, then uses each feature's new representation to compute a joint hidden representation via a dense layer: hðX 0 Þ ¼ ReLUðW 0 ½gðX 0 �;1 Þ; :::; gðX 0 �;m Þ� þb 0 Þ; ð4Þ Here W j is the weight matrix and b j the bias vector for the j th feature.
2. Long Short-term Memory (LSTM): an architecture in which inputs are propagated through a series of recurrent layers in order to learn sequential dependencies [36]. LSTMs have found success in time series classification, machine translation, and a number of other tasks in which the input can be modeled as a sequence. We use a standard sequential LSTM to compute a hidden representation for each day t < Δt according to: where tanh is the standard tanh activation function andx 0 t is the feature vector for day t in X 0 .
We experimentally evaluated several other state-of-the-art neural network architectures, including models based on attention [37] and convolutions, which are often used in time series classification and have the benefit of being translation invariant [38]. However, we found that these additional architectures performed empirically worse than both the FFNN and LSTM, so we do not consider them in the rest of this study.

Addressing class imbalance
When trained on imbalanced data, the predictions of machine learning models can be biased toward the majority class. We utilized the following techniques to address this problem: SMOTE. Synthetic minority oversampling technique (SMOTE) generates artificial examples of the minority class in order to balance the training set [39][40][41]. We utilized SMOTE in all the ensembles by oversampling the minority class (events) such that both classes had equal representation.
Weighted cross-entropy. Because backpropagation trains a neural network by computing the gradient with respect to a loss function, a useful technique in imbalanced settings is to weigh more heavily the loss contributions from the minority class. Toward this end we utilized a weighted version of cross-entropy (log loss). Consider the standard definition of binary cross-entropy, i.e.
where p(y i ) is the probability output by the model that example y i = 1 (is an event). We added a term a ¼ P n i¼0 y i n to weigh the contribution of each training example by the inverse of its class representation: This weighting allows for the set of events and the set of nonevents to each contribute half to the potential loss.
We explored a number of other options for addressing class imbalance, including underand over-sampling, sample and class weighting, and state-of-the-art loss functions for imbalanced classification [42,43]. However, we found that SMOTE and weighted cross-entropy provided the best performance in almost all cases for the ensembles and neural networks, respectively, so all results presented below use these techniques. Fig 11 in S1 Appendix summarizes the performance of other strategies.

Experimental setup
We first evaluated each model on the states that had experienced the greatest number of attacks: New York, California, Texas, Florida, and Washington (Table 1). While we also evaluated additional locations, our results demonstrate that in these cases the attacks were too sparse to learn effective models. Therefore, we dedicate most of our analysis to these five states.
As baselines, we evaluated a Random Guesser (assigned a random probability to each testing example), a single Decision Tree, a support vector machine (SVM) using an RBF kernel, and a Logistic Regression classifier. For all tree-based methods we used Gini index as the splitting criterion, and for each ensemble we used 3,000 estimators. We trained all neural networks for 100 epochs using stochastic gradient descent with a learning rate of 1e −4 , a decay parameter of 1e −6 , a batch size of 32, and Nesterov momentum. All hidden layers (except the first hidden layer in the two-layer FFNN architecture) contained 8,000 neurons, and each hidden layer in the LSTM model contained 1,024 neurons. For the neural network models we used min-max normalization to scale the input. Beyond computing the observation windows, no further preprocessing was required for the ensembles. We tested the fixed-length moving average approach with the ensemble methods and the stacked representation with the neural networks. However, due to the success of the K-S moving average (Δt � ) with the ensembles, we also evaluated this representational method with the neural networks. We evaluated each model using 5-fold cross-validation. This was straightforward for the neural network models and when Δt = 1, since each observation was independent. However, when using either moving average approach we considered the following in order to maintain proper separation between the training and testing sets: 1. When testing the variable-length moving average (Δt � ), we used only the training data to compute the optimal observation windows for each feature. We then computed moving averages on the testing set using the same window lengths that were learned from the training set. Finally, we measured model performance on each testing fold using AUROC. An ROC curve plots a model's true positive rate against its false positive rate at various discrimination thresholds, and is thus a useful measure of classification performance in an imbalanced setting. An AUROC of 1.0 indicates a perfect classifier, while a random guesser converges to 0.5. We  Table 2 summarizes the performance of the models that produced statistically significant predictions on at least three of the five states that experienced the most attacks. Table 3 shows the classification results for observation windows of 1 and 14 days for the baselines and ensembles, and 1 and 7 days for the neural networks. While some other observation windows produced better results for individual states, and as we discuss below, many models were very sensitive to the value of this parameter. Those shown in Table 3 produced at least one model that was strong across several states, and thus seemed the most useful and representative for evaluating how well different types of algorithms can perform on this problem. The standard deviation for each complete 5-fold cross-validation experiment was small (< 0.025 in all cases), but the inter-fold standard deviation was high in many cases. For example, on New York, the Random Forest (Δt � = 14) model consistently performed well on the first testing fold (.726 ± .036) but struggled on the fifth fold (.615 ± .036). This is not surprising, given that individual events are difficult to predict and that the small number of events in our data set means it is less likely that each training and testing fold will reflect the true distribution. In spite of this, we found that multiple models achieved results that are noteworthy and significantly better than random.

Results and discussion
The Random Forest with Δt � = 14 achieved an AUROC � .667 for every state except California, on which all the ensembles performed poorly. The neural networks performed better on California, but this was exchanged for poor performance on New York. The neural network that performed best on average was the FFNN with Δt = 7, L = 1, which achieved AUROC � .650 on three of five states. The other neural networks performed well on two states-California and Washington for the FFNN with Δt � = 7, L = 2, and Texas and Florida for the LSTMbut poorly on the others. Except for the SVM on California, none of the baselines performed well. Overall, we consider the Random Forest (Δt � ) to be the strongest model, followed closely by XGBoost (Δt � ), which produced very similar results on four of five states. The greatest difference was on Texas; however, this difference in AUROC may not be significant (p = 0.069), so we consider both models to be roughly equivalent in terms of overall performance. Of the neural networks, the FFNN (L = 1) produced statistically significant results on the greatest number of states (3). Since they represent the strongest overall performance of each family of models, we focus the remainder of our discussion on the performance of the Random Forest and FFNN (L = 1) and refer to them as RF and FFNN, respectively.  FFNN. Ideally, a model should be robust to small changes in the observation window and benefit (or at least not suffer) from increasing its length. We note that when using the K-S Moving Average (Δt � ) representation, the trend of RF's performance is relatively stable with respect to increased window length. FFNN and the standard Moving Average (Δt), on the other hand, are both relatively sensitive to small changes to the observation window. This instability is likely attributable to noise, and suggests that RF with the K-S Moving Average is less prone to overfit. From this result, we additionally note that the trend and optimal window varies between states. While this suggests that a predictive model may need to be tuned separately for each state, it is also possible that some of this variation is due to noise and/or only having few events per state.

Temporal locality of attacks
Some attacks recorded in the GTD are clustered in time and space. For example, in 2016 attacks were carried out in New York on August 9 (anti-Semitic extremists detonated explosives outside the houses of two Rabbis in New York City), August 10 (an arsonist targeted a private residence in Endicott), and August 13 (a man shot and killed an Imam and his assistant outside a mosque in New York City). Additionally, there are a few attacks that spanned a number of days and/or targets, such as in October 2018, when pro-Trump extremists sent pipe bombs to five different targets on four different dates. To test whether temporal locality had biased the model outputs, we recorded for each attack the predicted probability (P(y = 1)) and the number of days that had elapsed since the previous attack in that state. Fig 4 visualizes this distribution for RF and FFNN along with a regression line for each state. These results, along with the correlation coefficient from Spearman's rank test (r s ), indicate that RF still performs well on many of the attacks that are not close together in time, which is an important characteristic of a robust model. FFNN performed more poorly on attacks that were farther apart in time, which likely contributed to its weaker overall performance.

Number of attacks in training and testing
The fact that our data set is both small and highly imbalanced is almost certainly a limiting factor in model performance. Additionally, the temporal nature of our data limits our ability to use stratified cross-validation, so the events are not distributed evenly across training and testing folds. Fig 5 plots the average AUROC of each testing fold as a function of the number of events in the corresponding training fold for RF and FFNN, respectively. When computed across all states, the line of best fit and Spearman's rank coefficient (r s ) for these results  surprisingly suggest a negative correlation-i.e., that a greater number of events in the training and testing sets is correlated with worse performance. However, in the case of RF, we can see that the folds from California-on which RF's predictions were close to random guessingcontribute significantly to this result since they contain many events. Without California, r s increases from −0.241 to 0.136 for RF-the positive correlation we would expect. Continued efforts to collect both news and terrorism data will very likely drive further performance gains. However, for both RF and FFNN it seems that the lack of events or data is not the only limiting factor. It is likely that data quality and/or noise are additional barriers, ones that are more pronounced given the small number of events.

Feature types
Measuring the predictive power of each group of features can help us understand model performance as well as identify areas for future improvements. Toward this end, we retrained RF and FFNN using different combinations of feature groups. Fig 6 shows the baseline classification results on each state along with results from four additional feature-filtered models, and is the basis for the following observations: 1. In general, the importance of each feature group varied between states. This affirms the need to consider a unique model for each state.
2. The only feature group that improved performance in all cases was CAMEO counts. Further, in most states the model without CAMEO counts performed the worst. This implies that CAMEO counts are, on average, the most potent feature group.
3. In some cases, dropping a feature group improved model performance; for example, RF's predictions on California. This is further evidence of the impact of noise on classification results, and creates an important point of investigation for any future work that would seek to optimize predictions for a particular state or location.

Characteristics of attacks
Terrorist attacks vary widely with respect to responsible party, target, method and other properties. Some types of attacks may be easier or harder to predict, or even characterized by

Prediction windows
Our baseline task is to predict the occurrence of a terrorist attack in a given state on a single day. Some prior work found success in making predictions over coarser periods of time, such as a week [25]. However, we found this was not the case for our problem. We evaluated the following three methods for performing predictions over a given prediction window Δp > 1: 1. Label propagation: we propagated event labels backward to the previous Δp observations. For example, given an attack on Jan. 3 and Δp = 3, we labeled Jan. 1, 2, and 3 as attacks.
2. Date aggregation: we aggregated the news features from Δp dates into a single observation. For example, given Δp = 3, we combined Jan. 1, 2, and 3 into a single observation: Jan. 1-3. We used mean pooling to generate the feature values for each aggregated observation.

Learning from multiple states
One of our key challenges is a small and imbalanced data set. While our results thus far have shown that each state should be modeled independently, we also hypothesized that it could be possible to transfer knowledge between models. To evaluate the feasibility of such an approach, we trained a new set of models by combining data from multiple states. For example, when testing the New York model at baseline, we only included the training examples from New York; however, in these experiments we also supplemented the training set with additional data from another state (e.g., California) before evaluating on the New York testing set. As the heatmap in Fig 9 shows, this only proved beneficial on average for California. This is not Classification performance for models using larger prediction windows. The performance of both label propagation (top) and date aggregation (bottom) suggests that using coarser date representations hinders the models' ability to discriminate between classes. In each plot, AUROC is shown as a function of Δp (the prediction window) for the given state. Shaded regions represent the standard deviation across testing folds.
https://doi.org/10.1371/journal.pone.0270681.g008 surprising given our previous results, which have suggested that the importance of feature groups and the optimal observation windows vary between states, and further evidences the need for models to be tailored to individual states.

Predictions on additional states
Until this point, we have limited our experiments and analysis to five states. While we also wanted to evaluate our model on other states, none of them experienced more than seven recorded attacks (Table 1). We found that with so few minority class examples it was impossible to learn an effective model. In an attempt to address this challenge, we considered the eight Change in classification performance for models trained on multiple states. Each cell in the heatmap represents a unique model, with states on the x-axis representing the testing location and states on the y-axis representing the state that supplemented the training data. The value in each cell is the change in AUROC compared to the baseline model. For example, the cell (NY, +CA) represents a model tested on New York data and trained on a combination of New York and California data, and the value −0.232 means that this model produced an AUROC 0.232 less than the baseline model that was only trained and tested on New York data (Table 3). Avg represents the average change in AUROC for the corresponding row or column.
https://doi.org/10.1371/journal.pone.0270681.g009 states that had experienced between five and seven attacks, and used hierarchical clustering (Euclidean distance, average link) to group them based on the observed news features. We tested two different grouping methods: states, which almost entirely eliminates the coarse evaluation problem we previously described. Fig 10 presents the results from single-state testing using event thresholds between 6 and 16. Table 4 shows, for each testing state, the other states that were most similar and thus clustered, along with the results of group testing. The inconsistent and overall poor performance of both single-state testing and group testing is not surprising given our findings that supplementing the training data with examples from additional states reduced model performance, even though in this case we only included the most similar states. This suggests that in order to make meaningful predictions on these states, we either need more examples of attacks or a model that can account for more nuanced differences between states.

Limitations
Key limitations of our study include the following: 1. Data availability. The window of overlap in data availability from GDELT and the GTD (1,413 days) is relatively small. This fact, coupled with other issues like the difficulty of the prediction problem and a noisy feature space, limits performance and introduces more variance to the modeling process. Further, since many states do not have a sufficient number of recorded attacks, the majority of our study is limited to five states.
2. Coarse location modeling. While we have shown that modeling terrorism prediction as a binary classification problem can be fruitful, our treatment of locations as entire states (on the basis of data availability) is a coarse abstraction of real-world locations. For example, New York is a large state and most of the attacks have taken place near New York City. Further, our methods do not account for geographical relationships, e.g. the fact New York City shares a large border with New Jersey (NJ). This limitation is strongly related to data availability.
3. Shifts in feature and class distributions over time, especially within news data. Although news data from GDELT has proven useful in this study, our framework assumes the distribution of these features will remain relatively stable over time. However, this may not be the case, as the sociopolitical landscape of any location can shift for a number of reasons beyond terrorist attacks.

Conclusion
Terrorism presents a serious threat to global society. In this work, we used a series of machine learning models trained on localized news data to predict the occurrence of terrorist attacks in a given state and on a given day-a problem that, to the best of our knowledge, no prior work has attemped to solve. Our best model-a Random Forest that uses a novel Kolmogorov-Smirnov Moving Average method for feature representation-outperforms deep models and achieves an AUROC � 0.667 on four of five major states (p � .0001 as compared to random guessing) while being relatively insensitive to characterstics such as the amount of time between attacks or the type of attack. These results demonstrate that terrorist attacks can be predicted at a rate better than random, and show the potency of a multimodal approach in using news data to characterize a spatiotemporal location. Our results further indicate that the key factors limiting model performance are noise and the small number of attacks in the data set. Finally, our results affirm the need for localized models; in our case, manifested by creating separate models for each state. We envision a number of areas for future work, including: 1. Machine learning models that can better account for noise, either via manual feature engineering or noise-aware learning algorithms, and the differences between states.
2. The application of news data to other multimodal learning problems.
3. Continued efforts toward improving both the quality and quantity of terrorism-related data, including incorporating more sophisticated data augmentation strategies [46, 47].
4. Deeper studies into the relationship between news and specific terrorist attacks, such as investigating whether attacks can be traced to news topics like political events or socioeconomic themes.
Continued research in the use of machine learning to predict terrorist attacks promises to further our understanding of terrorism, inform strategies for mitigating or even preventing the attacks, and save lives.
Supporting information S1 Appendix. Supplemental evaluation. Includes the performance of other class imbalance strategies, runtime of K-S Moving Average, and analysis of the observation windows computed by K-S Moving Average. (PDF) S1 Table. p-values for model results in Tables 2 and 3